Variables, Samples, Population, Data
How are two or more variables related?
Examples:
Independent Variable → Dependent Variable
Education → Income
Two variables are associated if knowing the value of one of them will help to predict the value of the other.
Example:
Life Expectancy and Urbanization
Life Expectancy
We will examine UN data on average life expectancy for 214 countries.
We want to know if urbanization has a positive or negative relationship on life expectancy.
The Data
Life Expectancy vs Urbanization
Correlates of Life Expectancy
Scatterplot 1
A causal relationship entails three elements:
Causal relationships can be stipulated in hypotheses.
Relationships between variables can be stated in hypotheses.
A hypothesis is an explicit statement about the relationship between phenomena that formalizes the researcher’s informed guess.
People tend to adopt political viewpoints similar to their parents.
Democracies are more likely to engage in trade with one another.
Authoritarian regimes are more likely to violate human rights.
Countries where property rights are protected tend to have higher levels of development.
Definitions of concepts should be:
Concepts should strike a balance between the specific and the abstract.
Population – complete enumeration of some set of interest
To learn about the population, a sample is often studied
Sampling is the process of selecting a subset from the population
Sampling is used to estimate characteristics of the full population
Aim: Ensure sample is representative
Requirement: Know your population
Dominant approach: probability sampling
Representative sample – If repeated, the sample’s features would match those of the population on average
Probability sampling reduces sample selection bias and ensures representativeness
Categorical
Numerical
\[ \{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N} \]
\[ \{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N} \]
\[ \{45.38333, 68.28611, 57.53013, \dots, 77.04861\} = \{X_i\}_{i=1,\dots,N} \]
Example:
- Life expectancy = 59.75
- Level of urbanization = 66.4
Then the data point is:
\[ X = [66.4, 59.75] \]
Social scientists use statistical analyses to verify theories driven by carefully thought-out hypotheses.
Hypotheses are falsifiable claims about the world.
Hypotheses connect dependent variables to independent variables.
- Dependent variables: outcomes or things we want to explain
- Independent variables: factors that help explain the dependent variable
Hypothesis:
An increase in X (independent variable) leads to an increase in Y (dependent variable).
Democratization Hypothesis:
More economic development is associated with higher levels of democracy.
To test this, we collect data on X and Y.
Units of analysis are the entities where our theory applies (e.g., countries, individuals, firms).
When we collect the data, we input it into a spreadsheet, a tabular format.
This becomes a dataset.
In this example, there appears to be a positive relationship between X and Y.
- Not all high-X observations have high Y
- Not all low-X observations have low Y
To evaluate the relationship, we fit a line that best approximates the pattern in the data.
Each country’s data is a point in a scatter plot.
If we measure three variables (e.g., life expectancy, urbanization, education),
we get a 3D point cloud:
\[ \{X_1, X_2, X_3, \dots, X_T\} = \{X_t\}_{t=1,\dots,T} \]
This is depicted as a 2D scatter.
Time is one variable, and the value of interest is another.
So, each point in the time series is a pair: (time, value).
The following is a cross-section of time-series data:
Balanced Panel = Every unit is observed in every time period. No missing time points for any unit.
Unbalanced Panel = Some units are missing observations for some time periods.
Balanced Panel = Every unit is observed in every time period. No missing time points for any unit.
Unbalanced Panel = Some units are missing observations for some time periods.
Popescu (JCU): Variables, Samples, Population, Data